Skip to content

fix(runtime): autosave engine-implicit RuntimeCache via atexit + weakref#4362

Open
tp5uiuc wants to merge 3 commits into
pytorch:mainfrom
tp5uiuc:fix/runtime-cache-del-shutdown
Open

fix(runtime): autosave engine-implicit RuntimeCache via atexit + weakref#4362
tp5uiuc wants to merge 3 commits into
pytorch:mainfrom
tp5uiuc:fix/runtime-cache-del-shutdown

Conversation

@tp5uiuc

@tp5uiuc tp5uiuc commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Summary

Original bug from #4359: engine-implicit RuntimeCache handles relied solely on __del__ to autosave at process exit. When the engine survived to exit, __del__ fired during shutdown and the torchbind / filelock lazy imports raised ImportError: sys.meta_path is None, surfacing as Exception ignored in __del__ once per surviving handle.

Fixes

  • Move the autosave anchor to atexit.register(partial(_autosave_at_exit, weakref.ref(self))). atexit fires before module teardown; __del__
    stays as the mid-program GC path. Both flip autosave_on_del off so
    whichever runs first wins. __del__ unregisters its token to keep the
    atexit registry from accumulating dead entries.
  • Module-level helper closing over a weakref keeps the registration
    non-owning, so handles can still die mid-program normally.
  • RuntimeCache.__getstate__ / __setstate__ strip the unpicklable
    weakref from the pickle stream and re-register a fresh atexit hook
    on unpickle when autosave_on_del was on. __del__ reads the token
    via getattr(..., None) so copy.deepcopy partial-init edge cases
    stay silent.

Dependency on a sibling PR

This PR introduces an _atexit_token slot (partial(..., weakref.ref(self))) on RuntimeCache. CI tests that pickle compiled modules (test_mutable_torchtrt_module::test_save, test_cross_runtime_serde::test_save_*) walk through _runtime_settings.runtime_cache and would hit the weakref. The matching wrapper-side pickle exclusion lives in #4368 and should land first to unblock CI here.

Fixes #4359.

Type of change

  • Bug fix (non-breaking)

Test plan

TestRuntimeCacheAutosave in tests/py/dynamo/runtime/test_000_runtime_cache.py:

  • test_del_swallows_shutdown_import_error_on_pathsys.unraisablehook sees no leak
  • test_atexit_hook_saves_via_weakref — helper saves + flips flag
  • test_atexit_hook_no_op_on_dead_weakref — dead weakref no-ops cleanly
  • test_atexit_token_unregistered_after_delatexit.unregister called with the registered token
  • test_pickle_round_trip_strips_atexit_token — standalone-pickle round-trip rewires a fresh hook (stubbed handle to isolate from the sibling PR's _RuntimeCacheHandle picklability work)
  • pre-commit run clean on touched files (mypy skipped for two pre-existing errors at _TorchTensorRTModule.py:380 and :508, unrelated)

@meta-cla meta-cla Bot added the cla signed label Jun 24, 2026
@github-actions github-actions Bot added component: tests Issues re: Tests component: api [Python] Issues re: Python API labels Jun 24, 2026
@github-actions github-actions Bot requested a review from lanluo-nvidia June 24, 2026 17:21
@tp5uiuc tp5uiuc self-assigned this Jun 24, 2026
@tp5uiuc tp5uiuc marked this pull request as draft June 24, 2026 17:22
@tp5uiuc tp5uiuc force-pushed the fix/runtime-cache-del-shutdown branch from 6d02dad to a0a67d7 Compare June 25, 2026 00:53
@tp5uiuc tp5uiuc changed the title fix(runtime): silence ImportError from RuntimeCache.__del__ fix(runtime): autosave engine-implicit RuntimeCache via atexit + weakref Jun 25, 2026
@tp5uiuc tp5uiuc force-pushed the fix/runtime-cache-del-shutdown branch from a0a67d7 to a3770be Compare June 25, 2026 01:52
The previous `__del__`-only autosave path silently lost cache updates when
the engine survived until interpreter exit (typical for inference servers).
Python tears down `sys.meta_path` early in shutdown; the torchbind
attribute access in `self._handle.path` and the lazy `filelock` import
inside `save()` then raise `ImportError: sys.meta_path is None`, which
escaped `__del__` and surfaced as a noisy `Exception ignored in __del__`
once per surviving handle.

`atexit` callbacks run *before* module teardown, so the torchbind path
and lazy imports still resolve there. Register an `atexit` hook from
`__init__` whenever `autosave_on_del=True`. The hook closes over a
`weakref.ref(self)` so it doesn't pin the handle alive: a handle that
dies mid-program still goes through `__del__`, and the atexit hook later
sees a dead weakref and no-ops.

Other design points worth calling out:

* `_autosave_at_exit` is a module-level helper, not a bound method. A
  bound method captures `self` via `__self__`, which would defeat the
  weakref. The free function lets the closure carry only the weakref.

* Both `__del__` and the atexit hook flip `autosave_on_del` off before
  saving so whichever path runs first wins and the other no-ops -- no
  double-save, no double-leak risk.

* `__del__` unregisters its atexit token. Without this, a long-running
  process that churns engine-implicit handles (model swaps, A/B
  rollouts) accumulates dead atexit entries -- small per entry but
  unbounded.

* The `try` in `__del__` still wraps the whole body so any residual
  attribute-access failure during late-shutdown corner cases is swallowed
  rather than leaking to `sys.unraisablehook`.

Tests added in `TestRuntimeCacheAutosave`:
- `test_del_swallows_shutdown_import_error_on_path`: monkey-patches
  `_handle.path` to raise the shutdown `ImportError`; asserts via
  `sys.unraisablehook` that nothing leaks.
- `test_atexit_hook_saves_via_weakref`: exercises the helper directly,
  verifies it saves and flips `autosave_on_del`.
- `test_atexit_hook_no_op_on_dead_weakref`: dead weakref => no-op, no
  exception.
- `test_atexit_token_unregistered_after_del`: `atexit.unregister` is
  spied to confirm `__del__` cleaned up.

Refs pytorch#4359
@tp5uiuc tp5uiuc force-pushed the fix/runtime-cache-del-shutdown branch from a3770be to eacb80e Compare June 25, 2026 02:10
@tp5uiuc tp5uiuc marked this pull request as ready for review June 25, 2026 02:12
@tp5uiuc tp5uiuc requested a review from cehongwang June 25, 2026 02:12
@github-actions github-actions Bot added component: core Issues re: The core compiler component: runtime component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Jun 26, 2026
Comment thread py/torch_tensorrt/runtime/_runtime_cache.py
except Exception:
pass

def __getstate__(self) -> dict[str, Any]:

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers : these are added so that users calling pickle.save(RuntimeCache()) works as expected.

The atexit + weakref design from the previous commit added an
``_atexit_token`` slot (``partial(_autosave_at_exit, weakref.ref(self))``)
to ``RuntimeCache``. ``weakref.ref`` is unpicklable, so any caller that
pickled a ``RuntimeCache`` directly hit
``TypeError: cannot pickle 'weakref.ReferenceType'``.

Two defenses:

* ``__getstate__`` / ``__setstate__`` strip the token from the pickle
  stream and re-register a fresh atexit hook on unpickle when
  ``autosave_on_del`` was on. Atexit registration is a per-process
  artifact; the loading process gets its own hook bound to the loading-
  side instance.

* ``__del__`` reads ``_atexit_token`` via ``getattr(..., None)``.
  ``__init__`` sets the slot first, but ``copy.deepcopy`` can crash
  mid-state-copy on an unrelated field (e.g. the python-runtime
  ``_RuntimeCacheHandle._lock``, which is ``threading.Lock``) and leave
  the new instance with only some attributes set. The defensive read
  keeps ``__del__`` quiet rather than raising ``AttributeError`` to
  ``sys.unraisablehook``.

Test ``test_pickle_round_trip_strips_atexit_token`` exercises the
standalone-pickle path with a ``SimpleNamespace`` stub for ``_handle``
so the test isolates the ``__getstate__``/``__setstate__`` logic from
the orthogonal limitation that ``_RuntimeCacheHandle`` (python rt) is
itself unpicklable today.

The wrapper-side fix that prevents engine-implicit handles from being
pickled via ``torch.save(module)`` at all -- plus the matching
``_RuntimeCacheHandle`` picklability fix on python-rt -- is intentionally
a separate change; this commit only addresses pickle safety of the
``_atexit_token`` slot introduced in the prior commit.

Refs pytorch#4359
…helper

``__init__`` and ``__setstate__`` both registered the same
``atexit.register(partial(_autosave_at_exit, weakref.ref(self)))``
call. Hoist into ``_register_atexit_autosave`` and reuse from both
sites. The idempotency check (``_atexit_token is None``) moves into the
helper, so a second call is a no-op regardless of caller.

Behavior is unchanged. Addresses review comment on
py/torch_tensorrt/runtime/_runtime_cache.py:227.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed component: api [Python] Issues re: Python API component: core Issues re: The core compiler component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: runtime component: tests Issues re: Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🐛 [Bug] Serialization fails in Runtime Cache

1 participant